CUSTOMER CHURN PREDICTION

¶

Loading libraries¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, roc_curve, recall_score, confusion_matrix, precision_score
from sklearn.metrics import f1_score, classification_report, r2_score, auc

Loading the dataset¶

In [3]:
df=pd.read_csv("Telco_Customer_Churn.csv")
copied_df=df.copy()

Understanding the data¶

In [4]:
pd.options.display.max_columns=21
df.head()
Out[4]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

checking size of the data¶

In [5]:
df.shape
Out[5]:
(7043, 21)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
In [7]:
df.columns.values
Out[7]:
array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'Churn'], dtype=object)
In [8]:
df.columns=df.columns.str.lower()
df.columns.values
Out[8]:
array(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection',
       'techsupport', 'streamingtv', 'streamingmovies', 'contract',
       'paperlessbilling', 'paymentmethod', 'monthlycharges',
       'totalcharges', 'churn'], dtype=object)
In [9]:
df.dtypes
Out[9]:
customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

visualizing missing values¶

In [10]:
msno.matrix(df)
Out[10]:
<Axes: >
No description has been provided for this image

here in above visualization we can observe that there is no null values¶

  • above we can see that there are 11 blank spaces

Dropping the customerid¶

In [11]:
df.drop(['customerid'],axis=1,inplace=True)
df.head()
Out[11]:
gender seniorcitizen partner dependents tenure phoneservice multiplelines internetservice onlinesecurity onlinebackup deviceprotection techsupport streamingtv streamingmovies contract paperlessbilling paymentmethod monthlycharges totalcharges churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
In [12]:
df['totalcharges']=totaltharges=pd.to_numeric(df.totalcharges,errors='coerce')
df.isnull().sum()
Out[12]:
gender               0
seniorcitizen        0
partner              0
dependents           0
tenure               0
phoneservice         0
multiplelines        0
internetservice      0
onlinesecurity       0
onlinebackup         0
deviceprotection     0
techsupport          0
streamingtv          0
streamingmovies      0
contract             0
paperlessbilling     0
paymentmethod        0
monthlycharges       0
totalcharges        11
churn                0
dtype: int64
In [13]:
df[(df['totalcharges'].isna())]
Out[13]:
gender seniorcitizen partner dependents tenure phoneservice multiplelines internetservice onlinesecurity onlinebackup deviceprotection techsupport streamingtv streamingmovies contract paperlessbilling paymentmethod monthlycharges totalcharges churn
488 Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 NaN No
753 Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 NaN No
936 Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 NaN No
1082 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 NaN No
1340 Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 NaN No
3331 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 NaN No
3826 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 NaN No
4380 Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 NaN No
5218 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 NaN No
6670 Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 NaN No
6754 Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 NaN No
In [14]:
# blank_spaces=df.applymap(lambda x:x ==' ')
# blank_spaces.sum()

checking for black spaces¶

checking 0 year tenure value¶

In [15]:
df[df['tenure']==0]
Out[15]:
gender seniorcitizen partner dependents tenure phoneservice multiplelines internetservice onlinesecurity onlinebackup deviceprotection techsupport streamingtv streamingmovies contract paperlessbilling paymentmethod monthlycharges totalcharges churn
488 Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 NaN No
753 Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 NaN No
936 Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 NaN No
1082 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 NaN No
1340 Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 NaN No
3331 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 NaN No
3826 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 NaN No
4380 Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 NaN No
5218 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 NaN No
6670 Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 NaN No
6754 Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 NaN No
In [16]:
print(df[df['tenure']==0].count())
gender              11
seniorcitizen       11
partner             11
dependents          11
tenure              11
phoneservice        11
multiplelines       11
internetservice     11
onlinesecurity      11
onlinebackup        11
deviceprotection    11
techsupport         11
streamingtv         11
streamingmovies     11
contract            11
paperlessbilling    11
paymentmethod       11
monthlycharges      11
totalcharges         0
churn               11
dtype: int64
  • so there are only 11 values whose tenure is 0 these values can be deleted
In [17]:
df.drop(labels=df[df['tenure']==0].index,axis=0,inplace=True)
df[df['tenure']==0].index
Out[17]:
Index([], dtype='int64')
In [18]:
df.isnull().sum()
Out[18]:
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64
  • the null values of Totalcharges columns are drops with the tenure 0 rows

brief descriptive summary¶

In [19]:
df.describe()
Out[19]:
seniorcitizen tenure monthlycharges totalcharges
count 7032.000000 7032.000000 7032.000000 7032.000000
mean 0.162400 32.421786 64.798208 2283.300441
std 0.368844 24.545260 30.085974 2266.771362
min 0.000000 1.000000 18.250000 18.800000
25% 0.000000 9.000000 35.587500 401.450000
50% 0.000000 29.000000 70.350000 1397.475000
75% 0.000000 55.000000 89.862500 3794.737500
max 1.000000 72.000000 118.750000 8684.800000
In [20]:
df['seniorcitizen']=df.seniorcitizen.replace({0:'No',1:'Yes'})
df.head()
Out[20]:
gender seniorcitizen partner dependents tenure phoneservice multiplelines internetservice onlinesecurity onlinebackup deviceprotection techsupport streamingtv streamingmovies contract paperlessbilling paymentmethod monthlycharges totalcharges churn
0 Female No Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male No No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 Male No No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male No No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female No No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
In [21]:
df.describe()
Out[21]:
tenure monthlycharges totalcharges
count 7032.000000 7032.000000 7032.000000
mean 32.421786 64.798208 2283.300441
std 24.545260 30.085974 2266.771362
min 1.000000 18.250000 18.800000
25% 9.000000 35.587500 401.450000
50% 29.000000 70.350000 1397.475000
75% 55.000000 89.862500 3794.737500
max 72.000000 118.750000 8684.800000
In [22]:
df.describe(include =['object'])
Out[22]:
gender seniorcitizen partner dependents phoneservice multiplelines internetservice onlinesecurity onlinebackup deviceprotection techsupport streamingtv streamingmovies contract paperlessbilling paymentmethod churn
count 7032 7032 7032 7032 7032 7032 7032 7032 7032 7032 7032 7032 7032 7032 7032 7032 7032
unique 2 2 2 2 2 3 3 3 3 3 3 3 3 3 2 4 2
top Male No No No Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check No
freq 3549 5890 3639 4933 6352 3385 3096 3497 3087 3094 3472 2809 2781 3875 4168 2365 5163
In [23]:
fig,ax=plt.subplots(1,2,figsize=(10,10))
g_labels = ['Male', 'Female']
gender_count=df.gender.value_counts()
churn_count=df.churn.value_counts()
c_labels = ['No', 'Yes']
gender_count
ax[0].pie(gender_count,autopct='%0.1f%%',labels=g_labels,startangle=90, shadow=True, wedgeprops={'width':0.6})
ax[1].pie(churn_count,autopct='%0.1f%%',labels=c_labels,startangle=90, shadow=True, wedgeprops={'width':0.6})

plt.show()
No description has been provided for this image
  • 26.6 % customers switched to another firm.
  • 49.9 % are female and 50.5 % are male customer.
In [24]:
df["churn"][df["churn"]=="No"].groupby(by=df["gender"]).count()
Out[24]:
gender
Female    2544
Male      2619
Name: churn, dtype: int64
In [25]:
churn_no_count=df["churn"][df["churn"]=="No"].groupby(by=df["gender"]).count().sum()
churn_no_count
Out[25]:
5163
In [26]:
df["churn"][df["churn"]=="Yes"].groupby(by=df.gender).count()
Out[26]:
gender
Female    939
Male      930
Name: churn, dtype: int64
In [27]:
churn_yes_count=df["churn"][df["churn"]=="Yes"].groupby(by=df.gender).count().sum()
churn_yes_count
Out[27]:
1869
In [28]:
plt.figure(figsize=(4,4))
labels =["Churn: Yes","Churn:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ['#ff6666', '#66b3ff']
# colors=['pink','lightblue']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
explode = (0.3,0.3) 
explode_gender = (0.1,0.1,0.1,0.1)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct='%1.1f%%',pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90,frame=True, explode=explode,radius=10, textprops =textprops, counterclock = True, )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title('Churn Distribution w.r.t Gender: Male(M), Female(F)', fontsize=15, y=1.1)

# show plot 
 
plt.axis('equal')
plt.tight_layout()
plt.show()
No description has been provided for this image
  • There is negligible difference in customer percentage who changed the service provider. Both genders behaved in similar way when it comes to migrating to another service provider.
In [29]:
fig=px.histogram(df, x=df.contract,color='contract',title='<b> customer contract distribution<b>')
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
In [30]:
fig=px.histogram(df, x=df.churn,color='contract',barmode='group',title='<b> customer churn within contract distribution<b>')
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

what are the percentage of churn on contract basis¶

In [31]:
labels = df['paymentmethod'].unique()
values = df['paymentmethod'].value_counts()

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_layout(title_text="<b>Payment Method Distribution</b>")
fig.show()
In [32]:
fig = px.histogram(df, x="churn", color="paymentmethod",barmode='group', title="<b>Customer Payment Method distribution w.r.t. Churn</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
  • Major customers who moved out were having Electronic Check as Payment Method.
  • Customers who opted for Credit-Card automatic transfer or Bank Automatic Transfer and Mailed Check as Payment Method were less likely to move out.
In [33]:
df['internetservice'].unique()
Out[33]:
array(['DSL', 'Fiber optic', 'No'], dtype=object)
In [34]:
df[df["gender"]=="Male"][["internetservice", "churn"]].value_counts()
Out[34]:
internetservice  churn
DSL              No       992
Fiber optic      No       910
No               No       717
Fiber optic      Yes      633
DSL              Yes      240
No               Yes       57
Name: count, dtype: int64
In [35]:
df[df["gender"]=="Female"][["internetservice", "churn"]].value_counts()
Out[35]:
internetservice  churn
DSL              No       965
Fiber optic      No       889
No               No       690
Fiber optic      Yes      664
DSL              Yes      219
No               Yes       56
Name: count, dtype: int64
In [36]:
fig = go.Figure()

fig.add_trace(go.Bar(
  x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
       ["Female", "Male", "Female", "Male"]],
  y = [965, 992, 219, 240],
  name = 'DSL',
))

fig.add_trace(go.Bar(
  x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
       ["Female", "Male", "Female", "Male"]],
  y = [889, 910, 664, 633],
  name = 'Fiber optic',
))

fig.add_trace(go.Bar(
  x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
       ["Female", "Male", "Female", "Male"]],
  y = [690, 717, 56, 57],
  name = 'No Internet',
))
fig.update_layout(title_text="<b>Churn Distribution w.r.t. Internet Service and Gender</b>")

fig.show()
  • A lot of customers choose the Fiber optic service and it's also evident that the customers who use Fiber optic have high churn rate, this might suggest a dissatisfaction with this type of internet service.

Customers having DSL service are majority in number and have less churn rate compared to Fibre optic service.

In [37]:
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="churn", color="dependents", barmode="group", title="<b>Dependents distribution</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
  • customer without dependents are more likely to churn
In [38]:
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df, x="churn", color="partner", barmode="group", title="<b>Chrun distribution w.r.t. Partners</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
  • customers that doesn't have partners are more likely to churn
In [39]:
color_map = {"Yes": '#00CC96', "No": '#B6E880'}
fig = px.histogram(df, x="churn", color="seniorcitizen",barmode='group', title="<b>Chrun distribution w.r.t. Senior Citizen</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
  • senior citizens are very less compared to others
  • almost half of the total senior citizen churns
In [40]:
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="churn", color="onlinesecurity", barmode="group", title="<b>Churn w.r.t Online Security</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
  • Most customers churn in the absence of online security
In [41]:
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df, x="churn", color="paperlessbilling",  title="<b>Chrun distribution w.r.t. Paperless Billing</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
  • Customers with Paperless Billing are most likely to churn.
In [42]:
fig = px.histogram(df, x="churn", color="techsupport",barmode="group",  title="<b>Chrun distribution w.r.t. TechSupport</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
  • Customers who didnot get any TechSupport are most likely to migrate to another service provider.
In [43]:
color_map = {"Yes": '#00CC96', "No": '#B6E880'}
fig = px.histogram(df, x="churn", color="phoneservice", title="<b>Chrun distribution w.r.t. Phone Service</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
  • there are less customers who don't have phone service and out of that a small fraction of customre are likely to churn.
In [44]:
sns.set_context('paper',font_scale=1.1)
ax=sns.kdeplot(df.monthlycharges[df.churn=='No'],color='red',shade=True)
ax=sns.kdeplot(df.monthlycharges[df.churn=='Yes'],color='blue',shade=True)
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Monthly Charges');
ax.set_title('Distribution of monthly charges by churn');
No description has been provided for this image
In [45]:
df.head()
Out[45]:
gender seniorcitizen partner dependents tenure phoneservice multiplelines internetservice onlinesecurity onlinebackup deviceprotection techsupport streamingtv streamingmovies contract paperlessbilling paymentmethod monthlycharges totalcharges churn
0 Female No Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male No No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 Male No No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male No No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female No No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
In [46]:
ax = sns.kdeplot(df.totalcharges[(df["churn"] == 'No') ],
                color="Gold", shade = True);
ax = sns.kdeplot(df.totalcharges[(df["churn"] == 'Yes') ],
                ax =ax, color="Green", shade= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Total Charges');
ax.set_title('Distribution of total charges by churn');
No description has been provided for this image
In [47]:
fig = px.box(df, x='churn', y = 'tenure')

# Update yaxis properties
fig.update_yaxes(title_text='Tenure (Months)')
# Update xaxis properties
fig.update_xaxes(title_text='Churn')

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family='Courier'),
    title='<b>Tenure vs Churn</b>',
)

fig.show()
  • New customers are more likely to churn
In [48]:
plt.figure(figsize=(25, 10))

corr = df.apply(lambda x: pd.factorize(x)[0]).corr()

mask = np.triu(np.ones_like(corr, dtype=bool))

ax = sns.heatmap(corr, mask=mask, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, linewidths=.2, cmap='coolwarm', vmin=-1, vmax=1)
No description has been provided for this image

7. data preprocessing¶

we need to convert all the nominal categorical data to numerical data¶

In [49]:
X=df.drop('churn',axis=1)
y=df.churn
In [50]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
In [51]:
col=['gender','seniorcitizen','partner','dependents','phoneservice','multiplelines','internetservice','onlinesecurity','onlinebackup','deviceprotection','techsupport','streamingtv','streamingmovies','contract','paperlessbilling','paymentmethod']
transformer = ColumnTransformer(transformers=[
    ('tnf1',OneHotEncoder(sparse=False,drop='first'),col)
],remainder='passthrough')
In [52]:
X_train_transformed=transformer.fit_transform(X_train)
X_test_transformed=transformer.transform(X_test)
In [53]:
le=LabelEncoder()
y_train_transformed=le.fit_transform(y_train)
y_test_transformed=le.transform(y_test)
In [54]:
X_train_transformed.shape
Out[54]:
(5625, 30)
In [55]:
X_train.shape
Out[55]:
(5625, 19)
In [56]:
y_train.shape
Out[56]:
(5625,)
In [57]:
y_train_transformed.shape
Out[57]:
(5625,)
In [58]:
le.classes_
Out[58]:
array(['No', 'Yes'], dtype=object)
In [59]:
y_train_transformed
Out[59]:
array([1, 1, 1, ..., 0, 0, 1])
In [60]:
scaler=StandardScaler()
X_train_transformed=scaler.fit_transform(X_train_transformed)
X_test_transformed=scaler.transform(X_test_transformed)

model training, prediction and Evaluation¶

KNN¶

In [61]:
knn=KNeighborsClassifier(n_neighbors=12)
cv_score=cross_val_score(knn,X_train_transformed,y_train_transformed,cv=5)
print('cross validation score:', cv_score)
print('Mean cv accuracy:', np.mean(cv_score))
cross validation score: [0.80266667 0.792      0.75733333 0.78222222 0.8       ]
Mean cv accuracy: 0.7868444444444445
In [62]:
knn.fit(X_train_transformed,y_train_transformed)
knn_pred=knn.predict(X_test_transformed)
knn_accuracy=accuracy_score(knn_pred,y_test_transformed)
In [63]:
print(classification_report(y_test_transformed, knn_pred))
              precision    recall  f1-score   support

           0       0.82      0.88      0.85      1033
           1       0.60      0.48      0.53       374

    accuracy                           0.78      1407
   macro avg       0.71      0.68      0.69      1407
weighted avg       0.76      0.78      0.77      1407

Random forest¶

In [64]:
rf=RandomForestClassifier()
rf.fit(X_train_transformed,y_train_transformed)
Out[64]:
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
In [65]:
# make prediction
rf_pred=rf.predict(X_test_transformed)
rf_accuracy=accuracy_score(rf_pred,y_test_transformed)
print(rf_accuracy)
0.7846481876332623
In [66]:
print(classification_report(y_test_transformed,rf_pred))
              precision    recall  f1-score   support

           0       0.82      0.90      0.86      1033
           1       0.63      0.47      0.54       374

    accuracy                           0.78      1407
   macro avg       0.73      0.69      0.70      1407
weighted avg       0.77      0.78      0.77      1407

In [67]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test_transformed,rf_pred),annot=True,fmt='d',linecolor='k',linewidth=1,cmap='copper')
plt.title("Random Forest Confusion Matrix",fontsize=14)
plt.show()
No description has been provided for this image
In [68]:
rf_pred_prob = rf.predict_proba(X_test_transformed)[:,1]
fpr_rf, tpr_rf, thresholds = roc_curve(y_test_transformed, rf_pred_prob)
plt.plot([0, 1], [0, 1], 'k--' )
plt.plot(fpr_rf, tpr_rf, label='Random Forest',color = "r")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC Curve',fontsize=16)
plt.show();
No description has been provided for this image
In [69]:
# Assume that fpr, tpr, thresholds have already been calculated
optimal_idx = np.argmax(tpr_rf - fpr_rf)
optimal_threshold = thresholds[optimal_idx]
print("Optimal threshold is:", optimal_threshold)
Optimal threshold is: 0.31
In [70]:
auc_rf = auc(fpr_rf, tpr_rf)

Gradient Boosting classifier¶

In [71]:
gb = GradientBoostingClassifier()
gb.fit(X_train_transformed, y_train_transformed)
gb_pred = gb.predict(X_test_transformed)
print("Gradient Boosting Classifier", accuracy_score(y_test_transformed, gb_pred))
Gradient Boosting Classifier 0.7896233120113717
In [72]:
print(classification_report(y_test_transformed, gb_pred))
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1033
           1       0.64      0.48      0.55       374

    accuracy                           0.79      1407
   macro avg       0.73      0.69      0.71      1407
weighted avg       0.78      0.79      0.78      1407

In [73]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test_transformed, gb_pred),
                annot=True,fmt = "d",linecolor="k",linewidths=3,cmap='hot')
    
plt.title("Gradient Boosting Classifier Confusion Matrix",fontsize=14)
plt.show()
No description has been provided for this image
In [74]:
gb_pred_prob = gb.predict_proba(X_test_transformed)[:,1]
fpr_gb, tpr_gb, thresholds = roc_curve(y_test_transformed, gb_pred_prob)
plt.plot([0, 1], [0, 1], 'k--' )
plt.plot(fpr_gb, tpr_gb, label='Gradient boosting',color = "r")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Gradient Boosting ROC Curve',fontsize=16)
plt.show();
No description has been provided for this image
In [75]:
# Assume that fpr, tpr, thresholds have already been calculated
optimal_idx = np.argmax(tpr_gb - fpr_gb)
optimal_threshold = thresholds[optimal_idx]
print("Optimal threshold is:", optimal_threshold)
Optimal threshold is: 0.28522236073425244
In [76]:
auc_gb = auc(fpr_gb, tpr_gb)
In [77]:
# Plot ROC curves
plt.figure(figsize=(10, 6))
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.2f})')
plt.plot(fpr_gb, tpr_gb, label=f'Gradient Boosting (AUC = {auc_gb:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid()
plt.show()
No description has been provided for this image
  • Here we can say that Gradient Boosting is slightly performing better than Random forest

Conclusion ¶

Customer churn negatively impacts a firm's profitability.¶

Various strategies can be employed to mitigate customer churn:¶

  • Deeply understand customers to prevent churn.
  • Identify customers at risk of leaving and enhance their satisfaction.
  • Prioritize improvements in customer service.
  • Foster customer loyalty through personalized experiences and specialized services.
  • Survey customers who have already left to understand their reasons for leaving.
  • Adopt a proactive approach to prevent future churn.